Add scribe stream command for live microphone transcription by javiertoledo · Pull Request #1 · theam/scribe

javiertoledo · 2026-03-31T11:31:45Z

Summary

Add scribe stream command for live microphone transcription
Two engines: default (Parakeet TDT v3, multilingual, ~11s latency) and Nemotron (English-only, ~560ms latency)
README updated with streaming docs and engine trade-offs

Status

Draft — streaming not working reliably yet. Known issues:

Nemotron engine: output shows mixed/repeated text from accumulated transcript diffing
Default engine: gets stuck when switching languages mid-stream
Default engine: ~11s latency (inherent to SlidingWindow approach with batch model)
No system audio capture yet (mic only)

What works

scribe stream starts and captures microphone audio
scribe stream --engine nemotron downloads and loads the Nemotron model
Partial text preview on stderr
Both text and JSONL output formats
Model download retry on partial/corrupt cache

Architecture decisions

Nemotron 560ms via StreamingAsrEngine protocol (true cache-aware streaming)
Parakeet TDT v3 via SlidingWindowAsrManager (batch model in sliding windows)
Actor-based state for thread safety (Swift 6 sendability)

Test plan

Nemotron: speak English continuously, verify clean incremental output
Default: speak Spanish, verify transcription appears
Default: speak English then Spanish, verify no hang
--format jsonl produces valid JSON per line
--output file.txt saves to file
Ctrl+C exits cleanly

🤖 Generated with Claude Code

New command: scribe stream — captures microphone audio and transcribes in real-time using FluidAudio's SlidingWindowAsrManager (Parakeet). Features: - Live transcription from microphone with timestamps - Text and JSONL output formats - Save to file with --output - Ctrl+C to stop cleanly - Uses streaming ASR config (11s chunks, 1s hypothesis updates) Usage: scribe stream # listen and transcribe scribe stream --format jsonl # JSONL output scribe stream --output meeting.txt # save to file System audio capture (--source) will be added in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reduce chunk size from 11s to 3s for ~3-4s latency (was ~13s) - Lower confirmation threshold from 0.8 to 0.5 for faster output - Reduce right context from 2s to 0.5s - Fix speaker label: remove "Others" tag for mic input - Add text dedup to avoid repeating same hypothesis - Remove --mic flag (mic is default and only source for now) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pdates The 3s chunk config was too short for Parakeet — model needs ~10s context. Reverted to the library's .streaming preset (11s chunks, 1s hypothesis). Now shows two types of updates: - Volatile (hypothesis): shown as ephemeral line on stderr with \r overwrite Gives immediate ~1-2s feedback while speaking - Confirmed: printed as permanent line to stdout Stable, final text after sufficient context Also fixes: - Stream getting stuck on longer utterances (was breaking model state) - Text format shows live preview on stderr, final on stdout - JSONL emits both volatile and confirmed (with "confirmed" field) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace SlidingWindowAsrManager (batch TDT in sliding windows, ~11s latency) with StreamingAsrEngine protocol using Nemotron 560ms: - True cache-aware streaming: each 560ms chunk inherits full context - 2.12% WER (better than TDT v3's 2.5% on LibriSpeech) - Includes punctuation and capitalization - ~560ms to first text (was ~11s) - Partial transcript callback for live preview on stderr - Confirmed text printed to stdout Architecture: - Mic audio → appendAudio() → processBufferedAudio() → getPartialTranscript() - Partial callback fires on every chunk for live preview (\r overwrite on stderr) - Main loop polls at 20Hz, emits new confirmed text to stdout - Actor-based state management for thread safety (Swift 6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Default: Parakeet TDT v3 via SlidingWindow (25 languages, higher latency) - --engine nemotron: Nemotron 560ms (English-only, ~560ms latency, punctuation) Usage: scribe stream # multilingual (default) scribe stream --engine nemotron # English-only, low latency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Nemotron: retry with cache cleanup on failed model load (fixes partial download) - Both engines: show download progress messages (not just --verbose) - README: add streaming section with engine comparison and trade-offs - README: update performance table with streaming latencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Nemotron engine's partial callback returns the full accumulated transcript each time, which grows and revises. The previous code tried to diff via getPartialTranscript() polling, causing repeated/mixed output. Fix: Track printed length in StreamState actor. The partial callback fires after each 560ms chunk — we diff to find only the new portion and emit that. Live preview shows the tail of the transcript on stderr (ephemeral, overwritten). New confirmed text goes to stdout. Also simplified SlidingWindow engine to only emit to stdout on confirmed text (volatile goes to stderr preview only). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

javiertoledo and others added 7 commits March 30, 2026 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scribe stream command for live microphone transcription#1

Add scribe stream command for live microphone transcription#1
javiertoledo wants to merge 7 commits intomainfrom
feature/stream

javiertoledo commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javiertoledo commented Mar 31, 2026

Summary

Status

What works

Architecture decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant